#7 Working with (your own) Data
Faculty of Humanities and Social Sciences
University of Lucerne
11 April 2024
introducing Python 🐍
learning programming concepts & syntax
working with VS Code Editor
.txt or .pdf format ✅👉 basically, any textual documents…
.csv 🥰.txt 🙂.pdf 😬.csv ready off-the-shelf😓 There are still not many.
👉 search for a topic followed by corpus, text collection or text as data
Make your (Google) web search more efficient by using dedicated tags. Examples:
"computational social science"site:nytimes.comnature OR environment.csv👉 check out other resources licensed by ZHB
wget to download any files from the internet# get a single file
wget EXACT_URL
# get all linked pdf from a single webpage
wget --recursive --accept pdf -nH --cut-dirs=5 \
--ignore-case --wait 1 --level 1 --directory-prefix=data \
https://www.bk.admin.ch/bk/de/home/dokumentation/abstimmungsbuechlein.html
# --accept FORMAT_OF_YOUR_INTEREST
# --directory-prefix YOUR_OUTPUT_DIRECTORYnoise in text
archive holes
selective corpus curation
social bias
👉 think about the data and mitigate issues
⬇️
digital native documents .pdf, .docx, .html
⬇️
convert to .txt
⬇️
scans of (old) documents .pdf, .jpg, .png
⬇️
Optical Character Recognition (OCR)
machine-readable ✅
Illustration of text analysis generated by Image Creator from Microsoft Copilot
git pull. Check out the data samples in ked2024/materials/data and the scripts to extract their text in ked2024/materials/code.wget to download cogito and its predecessor uniluAKTUELL issues (PDF files) from the UniLu website. Start with downloading one issue first and then try to automatize the process to download all the listed issued using arguments for the wget command.